fix(pure): Make Pure FlashArray HTTP client timeout configurable#5551
fix(pure): Make Pure FlashArray HTTP client timeout configurable#5551rgolangh merged 2 commits intokubev2v:mainfrom
Conversation
| flag.StringVar(&vspherePassword, "vsphere-password", os.Getenv("GOVMOMI_PASSWORD"), "vSphere's API password") | ||
| flag.StringVar(&esxiCloneMethod, "esxi-clone-method", os.Getenv("ESXI_CLONE_METHOD"), "ESXi clone method: 'vib' (default) or 'ssh'") | ||
| flag.IntVar(&sshTimeoutSeconds, "ssh-timeout-seconds", 30, "SSH timeout in seconds for ESXi operations (default: 30)") | ||
| flag.IntVar(&storageAPITimeoutSeconds, "storage-api-timeout-seconds", 30, "HTTP client timeout in seconds for storage API requests (default: 30)") |
There was a problem hiding this comment.
We need to make the default os.GetEnv("STORAGE_HTTP_TIMEOUT_SECONDS") instead of 30 and that will allow that configurtion to be passed as part of the storage secret in the storageMap. Otherwise this is hard to use.
Also please add that entry in cmd/vsphere-xcopy-volume-popualtor/README.md under the STORAGE_ secret keys
There was a problem hiding this comment.
Both done in the latest commit 2a7932c:
-
storageAPITimeoutSecondsis now aStringVarwithos.Getenv("STORAGE_HTTP_TIMEOUT_SECONDS")as default, same pattern as the otherSTORAGE_*vars.strconv.Atoihandles the conversion at the call site with a warning log for bad values, and the<= 0guard inNewRestClientkeeps the 30s fallback. -
Added
STORAGE_HTTP_TIMEOUT_SECONDSto the secret keys table in README.
|
the DCO check is failing - please add you git signature |
d36ce72 to
6b5eabf
Compare
Done |
cmd/vsphere-xcopy-volume-populator/vsphere-xcopy-volume-populator.go
Outdated
Show resolved
Hide resolved
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5551 +/- ##
==========================================
- Coverage 15.45% 10.10% -5.35%
==========================================
Files 112 500 +388
Lines 23377 57429 +34052
==========================================
+ Hits 3613 5804 +2191
- Misses 19479 51144 +31665
- Partials 285 481 +196
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| case forklift.StorageVendorProductPureFlashArray: | ||
| apiTimeout, err := strconv.Atoi(storageAPITimeoutSeconds) | ||
| if err != nil && storageAPITimeoutSeconds != "" { | ||
| klog.Warningf("invalid value %q for storage-api-timeout-seconds, using default (30s): %v", storageAPITimeoutSeconds, err) |
There was a problem hiding this comment.
In the warning, change it to the new flag name
|
We are close, there's another small comment, and also please pull rebase |
…GE_HTTP_TIMEOUT_SECONDS Resolves: None Signed-off-by: Michael Jons <Michael.Jons@tre.se>
93f44aa to
9b43830
Compare
Done |
Resolves: None Signed-off-by: Michael Jons <Michael.Jons@tre.se>
|
|
/backport release-2.11 |
|
✅ PR #5551 backported to |
…nfigurable (#5615) **Backport:** #5551 **Make Pure FlashArray HTTP client timeout configurable** **Problem:** During migrations of VMs with many disks, simultaneous `CopyVolume` requests to Pure FlashArray were timing out, leaving PVCs stuck in `Pending`. In one observed case, 15 disks were migrated but only 7 reached `Bound` status — the remaining 8 populator pods failed with: ``` failed to copy VMDK using VVol storage API: copy operation failed: Pure FlashArray CopyVolume failed: failed to send copy volume request: Post "https://<array>/api/2.46/volumes?overwrite=true": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ``` The root cause is that the HTTP client timeout was hardcoded to 30 seconds with no way to extend it, making it impossible to accommodate slower or heavily-loaded arrays. **Changes:** - `NewRestClient` now accepts an `httpTimeoutSeconds int` parameter instead of a hardcoded value. A value of `<= 0` falls back to the 30s default. - `NewFlashArrayClonner` threads the parameter through to `NewRestClient`. - A `--storage-api-timeout-seconds` CLI flag (default: `30`) is added to the `vsphere-xcopy-volume-populator` binary. **How to configure:** Pass `--storage-api-timeout-seconds=<value>` to the populator binary. Full operator-side wiring (CRD field → `VSphereXcopyPluginConfig` → `VSphereXcopyVolumePopulatorSpec` → populator-controller pod args) is a follow-up. **Default behaviour is unchanged** — the timeout remains 30 seconds unless explicitly overridden. --------- Signed-off-by: Michael Jons <Michael.Jons@tre.se> Co-authored-by: Michael Jons <Michael.Jons@tre.se>



Make Pure FlashArray HTTP client timeout configurable
Problem:
During migrations of VMs with many disks, simultaneous
CopyVolumerequests to Pure FlashArray were timing out, leaving PVCs stuck inPending. In one observed case, 15 disks were migrated but only 7 reachedBoundstatus — the remaining 8 populator pods failed with:The root cause is that the HTTP client timeout was hardcoded to 30 seconds with no way to extend it, making it impossible to accommodate slower or heavily-loaded arrays.
Changes:
NewRestClientnow accepts anhttpTimeoutSeconds intparameter instead of a hardcoded value. A value of<= 0falls back to the 30s default.NewFlashArrayClonnerthreads the parameter through toNewRestClient.--storage-api-timeout-secondsCLI flag (default:30) is added to thevsphere-xcopy-volume-populatorbinary.How to configure:
Pass
--storage-api-timeout-seconds=<value>to the populator binary. Full operator-side wiring (CRD field →VSphereXcopyPluginConfig→VSphereXcopyVolumePopulatorSpec→ populator-controller pod args) is a follow-up.Default behaviour is unchanged — the timeout remains 30 seconds unless explicitly overridden.